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Abstract 

This paper deals with Gibbs samplers that include high dimensional conditional Gaussian distributions. It proposes 
an efficient algorithm that avoids the high dimensional Gaussian sampling and relies on a random excursion along 
a small set of directions. The algorithm is proved to converge, i.e. the drawn samples are asymptotically distributed 
according to the target distribution. Our main motivation is in inverse problems related to general linear observation 
models and their solution in a hierarchical Bayesian framework implemented through sampling algorithms. It finds 
direct applications in semi-blind / unsupervised methods as well as in some non-Gaussian methods. The paper provides 
an illustration focused on the unsupervised estimation for super-resolution methods. 


I. Introduction 

A. Context and problem statement 

Gaussian distributions are common throughout signal and image processing, machine learning, statistics,.. .being 
convenient from both theoretical and numerical standpoints. Moreover, they are versatile enough to describe very 
diverse situations. Nevertheless, efficient sampling including these distributions is a cumbersome problem in high 
dimensions and the current paper deals with this question. 

Our main motivation here is in inverse problems Q>@ and the methodology resorts to a hierarchical Bayesian 
strategy, numerically implemented through Monte-Carlo Markov Chain and more specifically the Gibbs Sampler 
(GS). Indeed, consider the general linear direct model y = Ax -|- n, where y, n and x are the observation, the 
noise and the unknown image and A is a given linear operator. Consider, again, two independent prior distributions 
for n and x that are Gaussian conditionally to a vector 9 , namely the hyperparameter vector. The estimation of 
both X and 9 relies on the sampling of the joint posterior p(x, 0|y), and this is the core question of the paper. 
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It commonly requires the handling of the high dimensional conditional posterior p(x|0,y) that is Gaussian with 
given mean m and precision Q. 

The framework directly covers non-stationary and inhomogeneous Gaussian models for image and noise. The 
paper has also fallouts for non-Gaussian models based on conditionally Gaussian ones involving auxiliary/latent 
variable^ (e.g., location or scale mixtures of Gaussian) for edge preserving and for sparse signals 

jT). It also includes other hierarchical models |(8), @ involving labels for inversion-segmentation. This framework 
also includes linear variant direct models and some non-linear direct models, based on conditional linear ones, 
e.g. bilinear or multilinear. In addition, it covers a majority of current inverse problems, e.g. unsupervised Q and 
semi-blind |Tg, by including hyperparameters and acquisition parameters in the vector 9. 

Large scale Gaussian distributions are also useful for Internet data processing, e.g. to model social networks 
and to develop recommender systems GD- They are also widely used in epidemiology and disease mapping eg. 
GD as they provide a simple way to include spatial correlations. The question is also in relation with spatial 
linear regression with (smooth) spatially varying parameters G!)- In these cases the question of efficient sampling 
including Gaussian distributions in high dimensions becomes crucial and it is all the more true in the “Big Data” 
context. 

B. Existing approaches 

The difficulty is directly related to handling the high-dimensional precision Q. The factorization (Cholesky, square 
root,...), diagonalization and inversion of Q could be used but they are generally infeasible in high dimensions 
due to both computational cost and memory footprint. Nevertheless, such solutions are practicable in two famous 
cases. 

• If Q is circulant or circulant-block-circulant an efficient strategy p3| , flh) relies on its diagonalization 
computed by FFT. More generally, an efficient strategy exists if Q is diagonalizable by a fast transform, 
e.g. discrete cosine transform for Neumann boundary conditions GD’ Gl- 

• When Q is sparse, a possible strategy | [T3| , fl^ , pO) relies on a Cholesky decomposition and a linear system 
resolution. Another strategy is a Gibbs sampler m that simultaneously updates large blocks of variables. 

In order to address more general cases, solutions founded on iterative algorithms for objective optimization or linear 
system resolution have recently been proposed. 

1) An efficient algorithm has been proposed by several authors GD’ GD’ GD’ G3 (previously used in 
applications Q, pO)). It is founded on a Perturbation-Optimization principle: adequate stochastic perturbation 
of a quadratic criterion and optimization of the perturbed criterion. However, in order to obtain a sample from 
the right distribution, an exact optimization is needed, but practically an empirical truncation of the iterations 
is implemented, leading to an approximate sample. |[24) introduces a Metropolis step in order to asymptotically 


*It is based on the fact that for a couple of random variables (U, V), the conditional law for U\V is Gaussian and the marginal law for U 
is non-Gaussian. A famous example is a Gaussian variable with precision under a gamma law: the resulting marginal follow a Student law. 
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retrieve an exact sample and then to ensure, in a global MCMC procedure, the convergence to the correct 
invariant distribution. 

2) In p^ , the authors propose a Conjugate Direction Sampler (CDS) based on two crucial properties: (i) a 
Gaussian distribution admits Gaussian conditional distributions and {ii) a set of mutually conjugate directions 
w.r.t. Q is available. The key point of the algorithm is to sample along these mutually conjugate directions 
instead of optimize as in the classic Conjugate Gradient optimization algorithm. 

In the first case, the only constraint on Q is that a sample from Af{0, Q) must be accessible, which is often the 
case in inverse problem applications. In the second case, Q must have only distinct eigenvalues to make the CDS 
give an exact sample. Otherwise it leads to an approximate sample as described in p^ . 

The proposed algorithm uses the same approach as the CDS and extends the efficiency to, theoretically, any 
matrix Q. 

C. Contribution 

The existing methods described above and the proposed one are both founded on a Gibbs Sampler. However, the 
existing ones attempt to sample the high dimensional Gaussian component x S whereas the proposed method 
does not. Our main contribution is to avoid the high dimensional sampling and only requires small dimensional 
ones. More precisely, given a subset D C R^, the keystone of the advance is to sample the sub-component of x 
according to the subset D. It must be sampled under the appropriate conditional distribution 7r(x£)|x\£), 0), with 
the decomposition x = The algorithm takes advantage of the ease of calculating the conditional pdf of 

a multivariate Gaussian, when D is appropriately built, as explained in section |I^ These ideas are strongly related 
to different existing works. 

• If the subset D is composed of only one direction in the canonical coordinates, the algorithm amounts to a 
pixel-by-pixel GS ||^. 

• The marginal chain x*^‘) can also be viewed as the one produced by a specific random scan sampler 
The random scans are related to the random choice of D, depending on the current value 

• Other algorithms based on optimization principles f2E\ , p0| aim at producing a complete optimization. On 
the contrary, in essence, the proposed approach only requires a few steps of the optimization process. 

• A similar idea is at work in Hamiltonian (or Langevin) Monte Carlo |3T)-|g (see also lig): the proposal 
law takes advantage of an ascent direction of the target to increase the acceptation probability. Here, the exact 
distribution is sampled, so the proposal is always accepted. 

However, to our knowledge, the proposed algorithm does not directly join the class of existing strategies. One 
contribution of this paper is to give sufficient assumptions for convergence, i.e. the samples are asymptotically 
distributed according to the joint pdf p(x, 6). 
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D. Outline 


Subsequently, Section [^presents the proposed algorithm and section Ill gives an illustration through an academic 
problem in super-resolution. Section [^presents conclusions and perspectives. 


II. Gradient Scan Gibbs Sampler 

In this section we describe the proposed algorithm: a Gibbs sampler with a high dimensional conditional Gaussian 
distribution. The objective is to generate samples from a joint distribution p(x, 6), where x G is highly 
dimensional and p(x|0) is a Gaussian distribution A/^(me, Qg ^): 

p(x|0) = (27r)“^/^(det exp - Je(x) (1) 

with the potential Jg dehned as: 

Je(x) = i(x-me)*Qe(x-me). (2) 

All the other variables of the problem are grouped into 9 G Q and we assume that the sampling from p{9\x.) is 
tractable (directly or with several steps of the Gibbs sampler, including Metropolis-Hastings steps). 


A. Preliminary results 

This section presents classic definitions and results, mostly based on p5) , needed to provide convergence proof 
and links between matrix factorization and optimization / sampling procedures. 

Consider Q a N x N symmetric dehnite positive matrix. 

Definition 1. A set {d.„, n = 1,.. ., N} of non-zero vectors in R^ such that: Q d^ = 0 for n,m = 1,..., N, n 
m is said mutually conjugate w.r.t. Q. A 


A mutually conjugate set {di,..., d^r} w.r.t. Q is a basis of R", then, for all x G R”: 


N 


X = ^ a„d„ with = 


dnQx 
d)jQd„ ■ 


So, if X ^ Af{m, Q is a Gaussian random vector with mean m and precision Q, then the are also Gaussian: 


d)jQm 1 

d);^ ’ dj;^ 

and reciprocally if the Q!„ are distributed under ([^ then x ^ A/^(m, Q^^). 

In particular, let x° G R^ be a “current” point and di G R^ a given “direction”. One can find d 2 ,..., d^v 
that {di,... jd^v} is mutually conjugate w.r.t. Q and x° writes: 

N 

x° = d„. 




(3) 

such 


Consider now the -dimensional subset 
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N 


D{x°) = < ^a„d„,a„ e R, n < Nd, an = a°, n > No 


^ n—1 


Nd 


|x° + ^(a„ - a°)d„, {ai,...,aNo) eR^°| 


n—1 


We are interested in the conditional pdf p(x|x € D(x^)). The following result and its proof can be found in p5) . 

Proposition 1. A sample x according to p(x|x G D(x^)) can be obtained by: 

1) sample independently the set (5i,..., a^^) with: 

'd^Q(x°-m) _ 1 


'JV 


d^,Qd„ ’ dt,Qd, 


n = 1,... ,Nd 


2) compute x = x° — d 


B. Gradient Scan Gibbs Sampler (GSGS) 

In the following we propose a Gibbs sampling algorithm in order to sample the joint probability p{x,6). The 
principle is to sample, at each iteration of the Gibbs sampler, only Njy directions of x instead of sampling the whole 
high dimensional variable. The chosen first direction of the set D will be the gradient of the potential of p(x|0), 
with a stochastic perturbation to ensure, in the general case, the convergence of the resulting Markov chain. The 
following directions are chosen so as to get a mutually conjugate subset with respect to the precision of p(x|0). 

We call our proposed algorithm the Gradient Scan Gibbs Sampler (GSGS) which is described by Algorithm 
In this algorithm the chosen first sampling direction di is given by the gradient of the potential of p{x\6), with 
an additional random perturbation e that follows a probability density p{£). In fact, we expect the gradient to be a 
good direction towards regions of high probabilities. Also, the gradient is easily computable and so gives an easy 
mle to sample from any current point x. Moreover, the other conjugate directions are iteratively computable as 
described in the Conjugate Direction Sampling (CDS) algorithm p5| used to get an approximated sample from a 
Gaussian distribution. In fact, the GSGS is embedding steps of the CDS in a global Gibbs sampler. 

The objective is now to study the convergence properties of the GSGS. We begin with two classic results. 

• If the Markov chain is aperiodic, ())—irreducible for some nonzero measure and has an invariant probability 
TT, then it converges to tt from 7r-almost every starting point (cf Theorem 4.4 of p^). 

• Moreover, if the Markov chain is Harris recurrent, then it converges to tt from all starting point |36|, m 
The Harris recurrence of Gibbs samplers, or more generally Metropolis-within-Gibbs samplers is well studied in 
HZ)- In particular, the Theorem 12 and Corollary 13 of p7| ensures that if the Markov chain produced by the 
GSGS is irreducible then it is Harris recurrent. Consequently, in the following we focus on showing that the Markov 
chain is aperiodic, irreducible and with stationary distribution p{x,6). 


An all the paper we will consider (f> as the Lebesgue measure and we will omit it for simplicity. 
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Algorithm 1 : Gradient scan Gibbs sampler (GSGS). 

Define an initial point a number Nd and a stopping criterion. Iterate . 


1: sample 0*-*^ ^ 

2: set Qt = Qg(t) and mt = m^ct), and compute the gradient g = = Qt(x(*“^^ — mj) 

3: sample a perturbation e ~ p{e) 

4: compute a set of mutually conjugate directions (di,..., d^r^) w.r.t. Qt such that 

di = g + e 


5: sample independently the set (ai ,... ,aNu) with; 


'AA 


d*„Qtd„ 


6: compute x(‘) = X^* - Yln=l d„ 

7: t ^t + 1. 

until the stopping criterion is reached. 


1 

d^Qtd„ 


n < Njy 


It is trivial to see that the Markov chain (x*^*\ produced by the GSGS, is aperiodic since for any non- 

negligible subset A G including P(x^*) € A) > 0. The existence of an invariant probability and the 

irreducibility can be shown by thinking of a random scan Gibbs sampling for the marginal component (x^^^jog. 

Proposition 2. The Markov chain produced by Algorithm^admits p(x, 0) as an invariant distribution, even without 
perturbations of the gradient direction (i.e. £ = Oj. 

Moreover, if the density p^e) is supported on R^, the Markov chain produced by Algorithm is irreducible, and 
therefore its law converges to p(x, 0). 

Proof: see appendix [A] ■ 

The Proposition 1^ then shows that the joint probability p(x, 6) remains an invariant distribution in the limit case 
where the first direction di is exactly the gradient of p(x|0), without random perturbation. However the perturbation 
is needed to ensure the irreducibility (and then the convergence) of the chain. 

If the gradient is not perturbed, the mutually conjugate set D is then given by a deterministic function of 0*^*^ 
and In this case, we need more assumptions to ensure the Markov chain to be irreducible. For example, we 

can have the following result. 

Proposition 3. Suppose the following conditions are satisfied: 

H-1 The function 9 i—> Qg is continuous 

H-2 V(x, 0) S R^ X 0 and'ir > 0, P(,B(0,r)|x) > 0, with B{9,r) the ball in 0, centered in 6, of radius r. 
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H-3 Vx G 30 G 0 such as: 

H-3.1 Qe has N distinct eigenvalues, 

H-3.2 X — mg is not orthogonal to any eigenvector of Qe, 

Then the Markov chain produced by Algorithm^without the perturbation step 3 (e = 0) is irreducible. 


Proof: see appendix [B] ■ 

The conditions described in Proposition are very restrictive and, in particular, condition H-3.1 is difficult, if 
not impossible, to prove in practice. This condition ensures that every non-negligible subset of R^ can be reached 


with a non-zero probability. It can be interpreted in the framework of Krylov spaces as in |26|. For example, if 
there is t such as the Krylov space 


(Qeco, ) := span x^*),..., 


r(‘) 


is of rank N then the Markov chain is irreducible. This condition can be weakened in our case because the Gaussian 
parameters mg(t) and Qg(t) are changing since 9 is changing at each iteration of the Gibbs sampler. Therefore a 
sufficient condition to ensure the irreducibility of the chain can be expressed as follows; 


Proposition 4. If there is T > N such as the union of Krylov spaces 

uLi(Qe ^ ) U (q(‘\ m'*’ ) 

is of rank N then the Markov chain built by the GSGS without perturbation of the gradient is irreducible. 

Proof: The condition implies that for any non-negligible subset A C R'^, P (x(^) G A|x(°)) > 0, which 
ensures the irreducibility. ■ 

The issue of determining general conditions, as in Proposition is an open problem at this time. The fact that 
the condition described in Proposition]^ is satisfied, highly depends on the model’s characteristics. That is why the 
GSGS (with the random perturbation step 3) is the one that ensures, in all cases, the convergence of the Markov 
chain to the joint distribution p(x, 0). 

The presented results do not allow us to get any convergence rate of the Markov chain. The latter is, in fact, very 
important to ensure in practice the efficiency of the estimators produced by simulations in finite time. In particular, 
the geometric ergodicity p8) is a very well known property that gives a Central Limit Theorem and ensures the 
Markov chain to quickly converge and give estimations of standard errors. However the Algorithm [T] aims to be 
general while the precise study of geometric convergence (especially to quantify the convergence rate) would need 
to specify the distributions on the parameters 6 and on the perturbation e. At this time, only weak assumptions are 
considered on these probabilities and the next section discusses about the different choices of p{e) from a feasibility 
point of view. 


September 14, 2015 


DRAFT 



8 


C. Choice of p[e) 


As previously specified, the only condition to ensure the convergence of the GSGS in the general case, is to 
choose a distribution ^(e) supported in In practice we also expect a sample from p{e) to be easily accessible. 
A natural choice is the Gaussian iid distribution M{0,1n)7 T-n being the N x N identity matrix. This was already 
studied in | [^ in the case of only sampling from a Gaussian distribution p(x) and where results are shown in small 
dimensions. 

Our empirical studies in high dimension (one example is shown in section m incited us to choose the Gaussian 
distribution JV{0, Qe), when it is possible. The sampling from this distribution may actually be easily computable, 
provided that Qg has, for example, the specific factorization form described in pO) : 

K 

fc=i 


In this case, the sampling from A/^(0, Qe) is easily computable by using the Perturbation Optimization (PO) algorithm 
ij^. The latter consists in (i) randomly modifying the potential Je(x) to get a perturbed potential Jq and (ii) 
optimizing Jg. The first step of this optimization procedure consists in computing the gradient VJe and it is trivial 
to show that it can be decomposed: W Jg{x.) = WJg{x) + e, with e ~ Qe)- Therefore, the perturbed gradient 
di of the GSGS, with a random perturbation e ~ Qe), can be obtained by using the PO algorithm truncated 
to one step of the optimization procedure. 

Although this choice is empirical, at this time, we may propose some intuition to recommend, when it is possible, 
the distribution JV{0, Qe). The first direction di is related to the gradient of Jg, in accordance with the objective 
to get a direction towards regions of high probability. This gradient is mostly driven by the highest eigenvalues of 
Qg. The perturbation e is only needed to ensure the GSGS convergence, but the objective is to keep a direction 
towards high probability regions. The sampling from Af{Q, Qg) seems to be a good compromise: it gives values of 
£ mostly driven by the highest eigenvalues of Qe and then the resulting direction di still continues to encourage 
the exploration space of high probability. 

We may also notice that some relaxations of the GSGS are possible, following classic arguments of a random 
scan Gibbs sampling. For example, it is not necessary to sample the perturbation from p(e) at each iteration, it is 
sufficient to do this an infinite number of times to ensure the chain to be irreducibl^ As we will see in section 


III a low frequency sampling of e can improve the algorithm’s efficiency. 


III. Unsupervised super resolution as a large scale problem 
A. Problem statement 

The paper details an application of the proposed GSGS to a super-resolution problem (identical to the one 
presented in 13§, p0|): several blurred, noisy and down-sampled (low resolution) observations of a scene are 
available to retrieve the original (high resolution) scene ED, eD- 

^From any point let s > t be the closest next time where e is sampled, then for any non-negligible subset A £ IR^ x ©, we 

have , A) > 0. 


September 14, 2015 


DRAFT 



9 


The usual direct model reads: y = Ax + n = SHx + n. In this equation, y G collects the pixels of the 
low resolution images (five 128 x 128 images, i.e. M = 81920) and x G R^ collects the pixels of the original 
image (one 256 x 256 image, i.e. N = 65536). The noise n G R^ accounts for measurement and modeling errors. 
H is a TV X iV circulant-block-circulant convolution matrix accounting for the optical and the sensor parts of the 
observation system. Here it is a square window of 5-pixel-width. S is a M x A matrix modeling motion (here 
translation) and decimation: it is a down-sampling binary matrix indicating which pixel of the blurred image is 
observed. 

The chosen prior for the noise is n ~ Af(0,7“^I), i.e. uncorrelated. Regarding the object, the chosen prior 
accounts for smoothness: x ^ Af(0, 7 “^D*D) where D is the N x N circulant convolution matrix of the Laplacian 
filter. The hyperparameters 7 „ and 7 x are unknown and the assigned priors are conjugate : Gamma distributions 
In ^ G {ctn, Pn) and 7x ~ 1/ (cix; /3x)- They are poorly informative for large variances and uninformative Jeffreys’ 
prior when the (q;xj/?x) tends to (0,0). As a consequence, the full posterior pdf writes 


p(x, 7x, 7n|y) oc p(y|x, 7„)p(x|7x)p(7x)p(7n) 

„ an+N/2-l a^ + (M-l)/2-l 
^ /n /x 

exp [-7n|ly - SHx||^/2] exp [-^„7„] 
exp [-7x||Dxf/2] exp [-/3x7x] • 


The conditional law of the image writes 

P(x|y,7x,7n) OC exp 
Accordingly the negative logarithm gives the criterion 


>^7x,7n(x) = ylly - Axf -f yllDxf 

V-^ 7 x, 7 „ W = 7 nA‘(Ax - y) -f 7 xD‘Dx 
= Q(x — m) 


-|^||y-SHx|p-|^||Dx||2 


2 I 7x I 


and the gradient 


with m = 7 nA*y, and the Hessian 


Q7x. 7„ = '^^'^7x,7„(x) = 7nA*A -f 7 xD*D 


(4) 


B. Gibbs sampler 

The posterior pdf is explored by the proposed Gibbs sampler in Algorithmic based on the GSGS, that iteratively 
updates 7n, 7x and a subset of x. Regarding the hyperparameters, the conditional pdf are Gamma and their 
parameters are easy to compute. 

The set of mutually conjugate directions w.r.t. Q 7 x. 7 n’ at step |C of Algorithm [C is computed by the Gram- 
Schmidt process applied to gradient, as usually found in conjugated gradient optimization algorithm. The procedure 
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Algorithm 2 : GSGS for super-resolution. 

Set t = 1, define an initial point and repeat 

1 : Sample 7 ^*^ --p ( 7 n|y, as 

^(^’||y-SHx(‘-i)p) ■ 

and p (7x|y, x(‘“^)) as 

^ 2 ’ ||Dx(‘-i)||2y ■ 

2: Set Qt = Q^(t) ^(t) and compute the gradient 

(x(‘-i)) = Qt(x(‘-i) - m) 


3: Sample a perturbation ~ A/'(0, Qt) 

4: Compute a set of No mutually conjugate directions {di,..., d^v^} with the first being di = 
5: Sample independently the set {an)n=i,...,No with; 




M 


dng 


(t) 


1 


dnQtdyi d^Q^d^ 


6: Compute X^*) ^ x(* - Yln=l dn- 

7: f 3- f -f 1. 

until the stopping criterion is reached. 


is similar to the algorithm described in p^ . Finally the estimator is the posterior mean computed as the empirical 
mean of the samples. 

Despite the convergence proof with almost any law for the perturbation e (provided that the density p(e) is 
supported in R^), some tuning is necessary to practically obtain a good space’s exploration. In practice, the step[^ 


has a major influence and, as already discussed in section II-C we observe that a working perturbation corresponds 
to those of the PO algorithm | |^ 


-(t) _ At) 1/2 t 


- 7n 


A'e, 


'7x 




where ex are two Gaussian normalized random vectors, leading to a Gaussian perturbation e^*^ of covariance Qj. 
However, the proposed algorithm has numerous advantages over the PO algorithm. First the proposed algorithm 
has a convergence proof because it does not suffer from truncation, even in the extreme case with = 1. Second 
the perturbation has the sole constraint of having as support. Moreover a perturbation is not required at each 
iteration. 


C. Numerical results 

The posterior law Q has been explored with the following four algorithms or settings. 
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• The adaptive RJ-PO algorithm pO) , directly tuned with the acceptance probability, here chosen to be 0.9. This 
acceptance probability leads to an average number of around 150 iterations of conjugate gradients to compute 
the proposal, and with 6% of rejected samples. 

• Algorithm with Njj = 150. The idea is to build an algorithm close to RJ-PO’s computing time. 

• Algorithm 1^ with Nd = 10. The idea is to show that our algorithm offers the possibility to reduce the number 
of iterations while still offering a good exploration and with guaranteed convergence. We empirically found 
that Nn = 7 is the lower limit case to have a good global exploration including the hyperparameters. 

• Algorithm 1^ with Njy = 2. The idea is to show a very fast algorithm that offers a partially correct exploration. 
This case is particular in the sense that the perturbation is done only once for the whole algorithm. 

The posterior mean estimations (PM) of the high-resolution image are given in Fig. [T] as well as the posterior 
standard deviation (PSD). From these results we can say that all algorithms provide similar quality for the image 
estimation. The same statement can be made for the standard deviation. However the posterior standard deviation 
with No = 2 seems incorrect. A possible interpretation is that the perturbation vector e is simulated only once 
during the whole algorithm. Thus, the space is surely not sufficiently explored and the covariance estimation is 
severely biased. Indeed, since ex are drawn only once, the stochastic explorations are limited to the conjugate 
direction plus the two directions Sx and However the mean estimation does not seem to be affected and this 
algorithm is able to provide very quickly a good estimation of the image and hyperparameters value. 



Figure 1: Image results. 
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RJ-PO 

150 

10 

2 

7n 

0.9718 

0.9694 

0.9452 

0.7078 

O' * 7 ^^ 

0.0063 

0.0061 

0.0066 

0.3395 

7? 

1.07e-03 

1.06e-03 

1.95e-03 

9.62e-03 


3.7e-05 

1.7e-05 

2.7e-05 

6.2e-03 

loop [s. 

] 4.4 

2.4 

0.2 

0.1 

total [s. 

] 666 

353 

28 

9 

I: Hyper 

■ parameters values estimation with true 7 , 


RJ-PO 

10 

2 



9.9e-03 

9.9e-03 

9.9e-03 



07 ^ 6.8e-05 

4.8e-05 

5.5e-05 



^ 1.84e-03 3.28e-03 2.29e-03 

5^ 3.2e-04 7.0e-04 3.4e-05 


Table II: Hyper parameters values with true 7 „ = 0.01. 


The chains of the hyperparameters are illustrated in Fig. Figs. and [^represent the samples as function of 
the iteration. We observe that, except for Njy = 2, all the chains have the same behavior with the same convergence 
period. The Nj:, = 2 has slower convergence but reaches the same stationary distribution. 

Figs. 2b and [2d| represent the samples as function of time (in seconds). The chains for RJ-PO and GSGS with 
Nd = 150 have the same behavior. This result is obvious since both algorithms compute almost the same number 
of gradients per iteration. That said, we see that for Nn = 10 and Nd = 2, the impact on the convergence time 
is signihcant. The Tab. shows some quantitative results. In particular the case N]j = 10 is ten times faster than 
RJ-PO. 


In addition. Tab. [II] shows a hyperparameter values estimation with a higher noise level. Again the estimated 
values are close and the 7 ^ parameter is correctly estimated. 

To illustrate the effect of the perturbation for good space exploration. Fig. shows the results when no pertur¬ 
bations are done and with No = 10. In this case, the hypotheses of Proposition are no longer verihed and 
those of Proposition cannot be verihed in practice. Moreover, the results show that both the covariance and the 
hyperparameters are wrongly estimated. This effect leads to an over-regularized image. A possible explanation is 
that the conjugate directions of the GSGS explore in a privileged way the directions of small variance (highest 
eigenvalues of Q). 

Regarding the computational cost, all the presented algorithm are dominated by the cost of matrix-vector product 
Qx. The cost thus depends on the specihc problems and the structure of Q in the same way than for conjugate 
gradient algorithm. For super-resolution problems, the cost of the matrix-vector product is almost equal to two 
discrete Fourier transforms of images. That said, the total number of matrix-vector is related to Nn and the number 
of Gibbs iteration. 
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Figure 3; Results without perturbation and Njj = 20. 


The main concluding comment is that the proposed algorithm allows a great improvement in the convergence 
time of the Gibbs sampler while being convergent to the true joint posterior law. However the speed improvement 
can come with a bad covariance estimation if the number Njj of directions for the image x is not sufficient. 

IV. Conclusion 

The handling of high-dimensional law, especially Gaussian, appears in many linear inverse and estimation 
problems. With the growing interest in “Big Data” and non stationary problems this task becomes critical. Moreover, 
the uncertainty around the estimated values, or the confidence interval, remains one of the difficult points combined 
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with hyperparameter estimation for automatic method design. 

The main contributions of this paper is (i) the proposition of a new algorithm in the class of the Gibbs samplers, 
able to address the case of high-dimensional Gaussian conditional laws, and (ii) the convergence proof of the 
algorithm. It relies on a random excursion along a small set of directions instead of handling with the high 
dimensional distribution. The directions are appropriately chosen according to the gradient of the potential. 

This new algorithm is shown to be an efficient alternative to existing work like the PO-type algorithms; we 
ensure the theoretical convergence of the algorithm and, in some cases, we can show a drastic computing-time 
improvement. 

The convergence of the algorithm is proved, provided that a random perturbation around the gradient direction 
is introduced. Even if in theory the only condition to ensure convergence is to choose a perturbation distribution 
supported in the whole space, it appears in practice that the results are very sensitive to the choice of the distribution. 
Moreover, the choice of the Gaussian distribution Qg) is the only case where the algorithm is more efficient 
than the PO and RJ-PO algorithm. The objective of our further work will be to better understand this high sensitivity 
to the choice of the perturbation distribution, that is, at this time, an open problem. 

In further work the objective will be to study the convergence rate of the GSGS. In particular, the geometric 
ergodicity is an important property that ensures a fast convergence and allows us to give estimations of standard 
errors. The geometric ergodicity of Gibbs samplers has long been studied pT) and a lot of results are shown in 
the Gaussian case 0, as well as for application in Bayesian hierarchical models | [45) , also in the case of joint 
Gaussian and Gamma distribution 0, 0, the latter being close to our illustration example. 

Also, one has to choose the number Njj of mutually conjugate directions to sample at each iteration of the 
algorithm. In theory, this does not affect the convergence properties of the algorithm. As a perspective, one can 


propose an automatic choice of Nd, following the work in |40| for the RJ-PO. 

The proposed algorithm is somewhat independent of the chosen direction. The use of preconditioner to compute 
direction as in preconditioned conjugate gradient should improve the computational cost by an Nu parameter 
smaller than at the present time. It depends, however, on each addressed problem. 

This paper is focused on linear conditionally Gaussian models. By use of hidden variable, the algorithm should 
also be able to handle non Gaussian models that are still conditionally Gaussian. 


Appendix 

A. Proof of Proposition 

This appendix is devoted to prove Proposition It is mainly inspired by the proofs presented in | |28) (see 
also p^ , p^ ) for different random scan strategies in order to sample p(x|0). The only difference is that the 
random choice is not according to a set of coordinates of x in the canonical basis, but according to a mutually 
conjugate set with respect to a current matrix Qg. Therefore the same arguments as detailed in p8) can be used to 
prove the irreducibility: if the support of the density p{e) is R^, all the directions can be explored in one step of 
the algorithm. Therefore any y G R^ can be reached in one step by taking, for example, di = — y, 5i = 1, 
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= 0, n = 2,..., Nd. Using classic continuity arguments, we can deduce that the probability of reaching any 
open ball B{y, r), centered in y of radius r, conditional to any current point is strictly positive, which ensures 
the chain to be irreducible. 

The rest of the proof focuses on the fact that p(x, 0) is an invariant probability of the chain. We use the same 
arguments and notations of 1^ . Let x G and a set D of mutually conjugate directions with respect to a dehnite 
positive matrix Q. We decompose x = {xd,x\o) which is always possible as explained in section [ii-Aj 

Dehne (x',0') G R^ x 0 a current point and (x',0') G R^ x 0 the point obtained by Algorithm with the 
transition Kernel: 

P{x,e\x!,e') = 7r(0|x',0')7r(x£,|x\£,,x',0)(5(x\£, - x^^,) 

with TT denoting any conditional probability and 5 is the Dirac function. The objective is to show that if (x', 9') is 
distributed according to the joint distribution p, then (x, 0) is also distributed according to p. 

Let A C R^ be a measurable set. The following lines are the result of the definition of the transition Kernel, 
the use of the general product rule, and of sequential integration with respect to 9 ^'d and x(^^: 


P((x,0)G A) 

= J 1a{x,9)P{x,9\x', 9')p{x',9')dxd9dx'd9' 

= J 1a(x, 9)tt{9\x', 9')Tr{xD\x\D, x', 9) ■ ■ ■ 

. .. S{x\]j — x{^jy)p{x', 0')dxd0dx'd0' 

= J 1a(x, 0)p(x', 9)Tr{xD\x\D, x', 9)... 

... S{x\D — x(^^)dxd0dx' 

= J 1a(x, 9)p{x[jj,9)Tr{xD\x\D,^\D^ ^) ■ ■ ■ 

... 5 { x\d - X\^)dxd0dx\£, 

= J 1 a ( x , 9)p{x\D,S)n{xD\x\D,S)dxd9 

= J lA{x,9)p{x,9)dxd9 

Hence the joint probability p(x, 9} is an invariant probability of the Markov chain produced by Algorithm 


B. Proof of Proposition 

This appendix is dedicated to prove Proposition]^ Let (x(°\ 0*-°^) G R^ x 0 be a current point and (x^*\ 0^*^) 
the point produced by the chain of Algorithm [T] at iteration t. The objective is to prove that for any non-negligible 
subset A C R'^ x 0, there is T > 0 such as P((x('^\ 9^'^^) G > 0. Using the hypothesis H-2, it is 

sufficient to prove that for any non-negligible subset A^ G R^, there is T > 0 such as: 

P(x(^) G > 0 (5) 
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Given we denote by 6 the corresponding element that respects conditions H-3. It is sufficient to prove the 
Proposition in the following framework: 

F -1 0 (^+ 1 ) = = ... = = e, 

F-2 mg = 0, 

F-3 Qe = diag(gi,..., gw) is diagonal. 

Indeed, if we prove the inequality (|^ with fixed 6 for N + 1 iterations, continuity arguments using conditions H-1 
and H-2 will end the proof of the Proposition. The simplihcations F-|^ and F-[^ can be assumed by a change of 
variable — me and by considering the basis of formed by the eigenvectors of Qe. 

In this simplified framework, the chain of Algorithm produces x^*\ t = 1,..., + 1, such as: 

x« = (I - a(*)Qe)(I - a(‘-i)Qe)... (I - 

with I the identity matrix in R^ and, noting x = {xi ,..., xnY, we have, for n = 1,..., A^: 

xW = (1 - ... (1 - ( 6 ) 

The hypothesis H-3.2 ensures that xi°^ 7 ^ 0, n = 1,..., A^, therefore we can assume without loss of generality that 
x^°^ = 1 , n = 1 ,..., A^, and equation (|^ is, in this case: 

x£‘) = (1 - ... (1 - (7) 

The following Lemma proves that any point in R^ can be reached by the chain in A^ + 1 iterations. 

Lemma 1. For any y G R^, there is a = such as = y, where is defined by 

(|7| with t = N + 1. 

Proof: This can be done by interpreting it as an interpolation problem: given y G R^, the objective is to show 
that there is a polynomial such as: 

= yn, n = l,...,7V ( 8 ) 

= 1 (9) 

with dehned by the right hand side of Q with f = A^ + 1. The constraint (|^ is due to the specihc form of 

Pa^^- Also the fact that the parameters must be real, implies that the polynomial P^^^ must have only real 
roots. It is well known that there is a polynomial of degree N that respects ([^ and (|^. Let us denote by Q such a 
polynomial. But the roots of Q may be complex. However we can show that there is a polynomial of degree A^ +1 
with real roots that respects the conditions (|^ and (|^. Indeed, let us consider the polynomial Q and a polynomial 
R of degree A^ -|- 1 such as R{qi) = R{q 2 ) = ■ ■ ■ = R{qN) = 7?(0) = 0. Therefore any polynomial P^ = Q + tR, 
T G R, respects conditions ([^ and and it is trivial to show that for r* sufficiently large, the polynomial P^* 
has all its roots r* G R, n = 1,..., A^. Therefore, taking P^^^ = Pt*, i-e. = l/r* ends the proof of the 
lemma. ■ 

Using this lemma and the continuity of P^^^ with respect to a, it is trivial to prove Q and then the Proposition. 
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